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AN EFFICIENT IMPLEMENTATION OF N-POINT DCT, N-POINT IDCT, 
SA-DCT AND SA-IDCT ALGORITHMS 



5 FIELD 

This invention relates to an implementation of algorithms for 
multimedia compression and decompression, more particularly, efficient 
implementation of n-point discrete cosine transform, n-point inverse discrete 
cosine transform, shape adaptive discrete cosine transform, and shape 
JO adaptive inverse discrete cosine transform algorithms using SIMD operations, 

j» MMX™ instructions, VLSI implementation, single processor 

on implementation or vector processing. 

eH BACKGROUND 

)5 Computer multimedia applications typically involve the processing of 

yi5 high volumes of data values representing audio signals and video images. 

^ Processing the multimedia data often includes performing transform coding 

which is a method of converting the data values into a series of transform 
coefficients for more efficient transmission, computation, encoding, 
compression, or other processing algorithms. 

20 More specifically, the multimedia data values often represent a signal 

as a function of time. Transform coefficients represent the same signal as a 
function, for example, of frequency. There are numerous transform 
algorithms, including the fast Fourier transform (FFT), the discrete cosine 
transform (DCT), and the Z transform. Corresponding inverse transform 
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algorithms, such as an inverse discrete cosine transform (iDCT), convert 
transform coefficients to sample data values. Many of these algorithms 
include multiple mathematical steps that involve decimal numbers. 

In an effort to allow for easy interchange of graphical formats, the 
5 International Standards Organization (ISO) and the Consultative Committee 

for International Telegraph and Telephone (CCITT) formed the Joint 
Photographic Experts Group (JPEG) and the Moving Pictures Expert Group 
(MPEG). The JPEG /MPEG committee published compression standards that 
q use the Discrete Cosine Transform (DCT) algorithm to convert a graphics 

010 image to the frequency domain. Efficient implementations of the DCT 

HJ algorithm is an interest since JPEG and MPEG algorithms strive to achieve 

%1 real-time performance. Most multimedia development software that uses 

ri this type of compression depend on the use of a coprocessor to generate 

bj compression. 

j=?"15 DCT is widely used in one dimensional (ID) and two dimensional (2D) 

signal processing. 2D 8x8 DCT is the basis for JPEG and MPEG compression. 
While there are presently algorithms that directly compute 2D 8x8 DCT, 
taking the 8-point ID transform of the rows and the columns is equivalent to 
the 2D 8x8 transform. There exists algorithms that compute ID 8-point DCT 
20 which can be used in the row-column method to perform a 2D 8x8 DCT. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



Additional advantages of the invention will become apparent upon 



reading the following detailed description and upon reference to the 
drawings, in which: 

Fig. 1 is a block diagram depicting multimedia compression and 
decompression, in another embodiment of the present invention; 

Fig. 2 depicts a bounding box and macroblocks of an arbitrary shaped 
video object in another embodiment of the present invention; 

Figs. 3a - 3e depicts the SA-DCT baseline algorithm for coding an 
arbitrarily shaped image segment contained within an 8x8 block, in another 
embodiment of the present invention; 

Fig. 4 depicts one embodiment of video compression; 

Fig. 5 depicts one embodiment of video decompression; 

Fig. 6 depicts one embodiment of SA-DCT; 

Fig. 7 depicts one embodiment of SA-IDCT; 

Fig. 8 depicts one embodiment of Single Instruction-Multiple-Data 
(SIMD); 

Fig. 9a depicts one embodiment of n-point DCT/IDCT; 

Fig. 9b depicts a factored embodiment of n-point DCT/IDCT. 



Exemplary embodiments are described with reference to specific 
configurations. Those skilled in the art will appreciate that various changes 
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and modifications can be made while remaining within the scope of the 
claims. 

Multimedia extension (MMX™) is used to implement SIMD 
operations. Existing algorithms do not reduce the clock cycle count of the 
implementation in MMX™ although they minimize the number of addition 
and multiplication operations. In another embodiment, the PMADDWD 
instruction used in existing algorithms multiplies and adds, making it 
unworkable to obtain four discrete 32-bit values from four sets of 16-bit 
multiplies. The present invention reduces processor time by having 
operations done with minimal PMADDWD instructions. 

The invention provides an efficient implementation of n-point 
discrete cosine transform, n-point inverse discrete cosine transform, shape 
adaptive discrete cosine transform (SA-DCT) and shape adaptive inverse 
discrete cosine transform (SA-IDCT) algorithms for multimedia compression 
and decompression optimization. An n-point DCT function is represented by 
a first equation having an input matrix, an output matrix and a matrix of 
predetermined values. An n-point IDCT function is represented by a second 
equation having an input matrix, an output matrix and a matrix of 
predetermined values. The multiplication operations within the matrix of 
predetermined values are paired, thereby reducing processor instructions. In 
another embodiment, SIMD operations are used to perform the algorithms. 
In another embodiment, MMX operations being one type of SIMD operations 
is used to perform the algorithms. In another embodiment, vector processing 
is used to perform the algorithms. In another embodiment, single processor 
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implementation is used to perform the algorithms. In yet another 
embodiment VLSI implementation is used to perform the algorithms. 

In an embodiment, a machine readable storage medium having 
executable instructions which, when executed by a processor, implements n- 
5 point discrete cosine transform (n-point DCT) algorithms, n-point inverse 

discrete cosine transform (n-point IDCT) algorithms, shape adaptive discrete 
cosine transform (SA-DCT) algorithms and shape adaptive inverse discrete 
cosine transform (SA-IDCT) algorithms for multimedia compression and 
n decompression is provided. A machine-readable storage medium includes 

B40 any mechanism that provides (i.e., stores and /or transmits) information in a 

pi form readable by a machine (e.g., a computer). For example, a machine- 

5 it ; 

readable medium includes read only memory (ROM); random access memory 
(RAM); magnetic disk storage media; optical storage media; flash memory 

hj devices; electrical, optical, acoustical or other form of propagated signals (e.g., 

OS carrier waves, infrared signals, digital signals, etc.); etc. 

In another embodiment, once the video signal has been stored as data 
in the computer system memory, the data is manipulated at compression 
stage 6, which may include compressing the data into a smaller memory 
space. In FIG. 1, at stage 6, by occupying a smaller memory space, the video 
20 signal is more easily stored or transmitted because there is less data to store or 

transmit, requiring less processing power and system resources. Video signal 
16, stored in memory registers of the computer system, is directed to 
compression stage 6. In the spatial domain, video signal 16 is represented by a 
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waveform in which the amplitude of the signal is indicated by vertical 
displacement while time or space is indicated by horizontal displacement. 

For many compression methods it is desirable to transform a signal 
from the spatial domain to another domain, such as the frequency domain, 
before analyzing or modifying the signal. After video signal 16 is received at 
compression stage 6, the signal is transformed from the spatial domain to the 
frequency domain. In the frequency domain, the amplitude of a particular 
frequency component (e.g. a sine or cosine wave) of the original signal is 
indicated by vertical displacement while the frequency of each frequency 
component of the original signal is indicated by horizontal displacement. The 
video waveform 16 is illustrated in the frequency domain at step 18 within 
compression stage 6. 

In another embodiment, transformation of a signal from the spatial to 
the frequency domain involves performing a Discrete Cosine Transform of 
the data elements representing the signal. For example, in accordance with 
some JPEG and MPEG standards, square subregions of the video image, 
generally an 8 x 8 array of pixels, are transformed from the spatial domain to 
the frequency domain using a discrete cosine transform function. This 8x8 
array of pixels corresponds to 8x8 data elements, each data element 
corresponding to the value (e.g. color, brightness, etc.) of its associated pixel in 
the 8x8 array. For another embodiment, other transform functions are 
implemented such as, for example, a Fourier transform, a fast Fourier 
transform, a fast Hartley transform, or a wavelet transform. 
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In another embodiment of the present invention, the signal is 
reconverted back into the spatial domain by applying an inverse transform to 
the data. Alternatively, the signal remains in the frequency domain and is 
transformed back into the spatial domain during the decompression stage, as 
described below. 

Upon receiving the compressed video signal at receiving stage 10, the 
data associated with the signal is loaded into computer system memory. In 
addition, if the video signal is encrypted, it is decrypted here. At 
decompression stage 12, the signal is decompressed by a method including, for 
example, applying an inverse transform to the data to translate the signal back 
into the spatial domain. This assumes the signal has been transmitted in a 
compressed format in the frequency domain from computer system 24. For an 
embodiment in which the compressed video signal is transmitted in the 
spatial domain, application of an inverse transform during the 
decompression stage may not be necessary. However, decompression of a 
video signal may be more easily accomplished in the frequency domain, 
requiring a spatial domain signal received by decompression stage 12 to be 
transformed into the frequency domain for decompression, then back into the 
spatial domain for display. 

Once decompressed, the signal is transferred to display stage 14, which 
may comprise a video RAM (VRAM) array, and the image is displayed on 
display device 30. Using this technique, a user at computer system 24 can 
transmit a video image to computer system 26 for viewing at the second 
computer terminal. Similarly, computer system 26 may have similar video 
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and audio transmission capabilities (not shown), allowing display and audio 
playback on display device 28 and speakers 32, respectively, of computer 
system 24. In this manner, applications such as video conferencing are 
enabled. 

As shown in Figure 4, SA-DCT can be used in one embodiment of 
video compression devices 490. Motion estimation 410 and motion 
compensation 420 can remove the temporal redundancy in the pictures. SA- 
DCT 430 can remove the spatial redundancy by transforming "time-domain" 
information into "frequency-domain" information with help from 
Quantization 440. Variable Length Encoder (VLC) 450 compresses the 
frequency-domain data into bits. Inverse Quantization 460, SA-IDCT 470, and 
motion compensation 480 are used to improve the encoding quality. 

As shown in Figure 5, SA-IDCT can be used in one embodiment of 
video decompression devices 560. VLD 510 and Inverse Quantization 520 
reverse bits into frequency-domain data. SA-IDCT 470 reverses the frequency- 
domain data into spatial domain data. Motion compensation 540 reconstructs 
the images 550 and 570. 

As shown in Figure 6, n-point DCT can be used in one embodiment of 
SA-DCT 430. First, the data is shifted in the vertical direction 432 (Figure 3b). 
Second, n-point DCT is performed column by column 434 (Figure 3c). The 
data is shifted in the horizontal direction 436 (Figure 3d). As shown in figure 
3e, n-point DCT is performed row by row 438. 

As shown in Figure 7, n-point IDCT can be used in one embodiment of 
SA-IDCT 470. N-point DCT is performed row by row 472. Data is shifted in 
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the horizontal direction 474. N-point DCT is performed column by column 
476. The data is shifted in the vertical direction 478. 

As shown in Figure 8, SIMD uses single instruction to operate on 
multiple data. 64-bit data 820 contains 16-bit data 822, 824, 826, and 828. 64-bit 
data 840 contains 16-bit data 842, 844, 846, and 848. PMADDWD is used (one of 
the MMX instructions which is one type of SIMD instructions) to add the 
multiplication result of 822 and 832 and the multiplication result of 824 and 
834 as well to add the multiplication result of 826 and 836 and the 
multiplication result of 828 and 838. 

As shown in Figure 9, matrix multiplication can be used for n-point 
DCT/IDCT 434, 438, 472, and 478. For n-point DCT, input [X] is the frequency- 
domain data 930 and output [Y] is the time-domain data 910. In one 
embodiment, matrix [A] is factored into [S][M][B], where the number of 
multiplications is reduced. An embodiment of this invention is to use SIMD 
operation for n-point DCT/IDCT. As applied to DCT, matrix 910 represents 
frequency domain data, and matrix 930 represents time domain data. As 
applied to IDCT, matrix 910 represents time domain data, and matrix 930 
represents frequency domain data. 

JPEG lossy compression algorithms operate in three successive stages, 
DCT transformation, coefficient quantization, and lossless compression. DCT 
is a class of mathematical operations that include the Fast Fourier Transform 
(FFT). The basic operation performed by FFT is to transform a signal from 
one type of representation to another. DCT is used for compression and IDCT 
is used for decompression. During compression, DCT transforms a set of 
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points from the spatial domain into a representation in the frequency 
domain. During decompression, an IDCT function converts the spectral 
representation of the signal back to a spatial one. The formula for the DCT 
and IDCT is shown in table 1 and table 2, respectively. 



1 

C(x) = — if x is 0, else 1 if x > 0 
V2 



Table 1 



C(x) = -=■ if x is 0, else 1 if x > 0 



Table 2 



One embodiment of the DCT algorithm is performed on an N x N 
square matrix of pixel values, and it yields an N x N square matrix of 
frequency coefficients. DCT performs a matrix multiplication of the input 
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pixel data matrix by the transposed cosine transform matrix and stores the 
result in a temporary N by N matrix. The temporary matrix is multiplied by 
the cosine transform matrix, and the result is stored in the output matrix. 

The DCT computation complexity is simplified by factoring out the 
transformation matrix into butterfly and shuffle matrices. The butterfly and 
shuffle matrices can be computed with fast integer addition, the resulting 
zeroes in the original matrix being trivial to compute. In most of the fast DCT 
algorithms, optimization usually focuses on reducing the number of DCT 
arithmetic operations, especially the number of multiplications. 

IDCT essentially uses the reverse of the operations performed in the 
DCT. In one embodiment, the DCT values in the N by N matrix are 
multiplied by the cosine transform matrix. The result of this transformation 
is stored in a temporary N by N matrix. This matrix is then multiplied by the 
transposed cosine transform matrix. The result of this multiplication is 
stored in the output block of pixels. 

The MPEG-4 video coding standard supports arbitrary-shape video 
objects in addition to the conventional frame-based functionalities in MPEG-1 
and MPEG-2. Thus, in MPEG-4, the video input is no longer considered as a 
rectangular region. One of the building blocks for MPEG-4 video coding 
standard version 2 is the shape-adaptive-DCT (SA-DCT) for arbitrary shape 
objects. In an MPEG-4 image, there are contour macroblocks which contain 
the shape edge of an object, as shown in Fig. 2. Instead of performing an 8x8 
DCT after filling the non-object pixels, the new standard adaptively performs 
N-point DCT based on the shape. For contour macroblocks, only object pixels 
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are transformed into DCT domain. The procedure of transforming only 
object pixels into DCT domain is called shape-adaptive DCT. In one 
embodiment, this invention optimizes SA-DCT and SA-IDCT for MPEG-4 
object based coding scheme using platform-dependent knowledge. Compared 
to 8x8-DCT, SA-DCT provides a significantly better rate-distortion trade-off, 
especially at high bit rates. 

Standard 8x8 DCT is applied to 8x8 blocks with 64 opaque pixels. In 8x8 
blocks that straddle the boundaries of a VOP, standard DCT is replaced by 
shape adaptive DCT (SA DCT). These boundary blocks are arbitrary shape 
with at least one transparent pixel in which the number of opaque pels is less 
than 64. 

Similar to standard DCT, forward and inverse SA DCT convert 
pixel(x,y) to DCT(i,j) and vice versa. SA DCT also keeps all conditions on the 
internal precision of floating point arithmetic as well as the rounding to 
integers and the dynamic ranges of pixel(x,y) and DCT(i,j) stated in 8x8 DCT. 
In contrast to standard 8x8 DCT, the internal processing of SA DCT is 
controlled by shape parameters, which are derived from the decoded VOP 
shape. The opaque pixels within the boundary blocks are only transformed 
and coded. As a consequence, SA DCT does not require the padding 
technique, if shape coding is lossless, and the number of achieved SA DCT 
coefficients is identical to the number of opaque pixels in the given boundary 
block. 

Figures 3a-3e depict the SA-DCT baseline algorithm for coding an 
arbitrarily shaped image segment contained within an 8x8-block. The SA- 
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DCT algorithm is based on predefined orthonormal sets of DCT basis 
functions. The forward 2D SA-DCT first applies ID DCT transformation to 
columns, and then to rows. The inverse 2D SA-DCT applies the ID IDCT 
transform first to rows, then to columns. Figure 3a depicts an image block 
segmented into two regions, foreground as shown in gray and background as 
shown lighter. To perform the vertical transform of the foreground, the 
length (vector size N, 0<N<9) of each column j (0<j<9) of the foreground 
segment is calculated. As depicted in Figure 3b, the columns are shifted and 
aligned to the upper border of the 8x8 reference block. 

While dependent on the vector size N of each particular segment 
column, a ID n-point DCT, a transform kernel containing a set of N basis 
vectors DCT-n, is selected for each particular column and applied to the first 
N column pixels. For example, as depicted in Figure 3b, the right most 
column is transformed using 3-point DCT. As depicted in Figure 3d, before 
the horizontal DCT transformation, the rows are shifted to the left border of 
the 8x8 reference block. Figure 3e depicts the final location of the resulting 
DCT coefficients within an 8x8-image block. 

The final number of DCT coefficients is identical to the number of 
pixels contained in the image segment. Additionally, the coefficients are 
located in comparable positions as in a standard 8x8 block. The DC coefficient 
is located in the upper left border of the reference block and is dependent on 
the actual shape of the segment. The remaining coefficients are concentrated 
around the DC coefficient. Since the contour of the segment is transmitted to 
the receiver prior to transmitting the macroblock information, the decoder 
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performs the shape-adapted inverse DCT as the reserve operation in both 
horizontal and vertical segment direction on the basis of decoded shape data. 
ID N-point DCT is accomplished by the following equation: 



fc 1 (n(2k + \) \ 

y " =c '^r\—ir- n y 



1 [2 
where c n = —7= and c„ = J — forn = l,...,N — I 

0 " In j 



Table 3 



The computation of the 2-point DCT can be simplified as follows: 
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Table 4 
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In conventional algorithmic optimization, the number of additions and 
multiplications is minimized. Thus, 



|z 0 = x 0 + x 1 
l z l = X 0 - X-l 



1 

Yl = — z-t 



Table 5 



In this way, only two additions and two multiplications are performed, 
instead of two additions and four multiplications. The following C code for 
this algorithm is currently used. 

void fsadct2_float (float in[2], float out[2]) 
{ 

static float fO = 0.707107; 

out[0] = (in[0] + in[l]) * fO; 
out[l] = (in[0] - in[l]) * fO; 

} 

In one embodiment of the invention, using MMX™ and Streaming 
SIMD Extensions, two additions and four multiplications can be performed 
quickly with only one PMADDWD instruction for the 2-point DCT as follows: 

void fsadct2_mmx (short in[2], short out[2]) 
{ 

static _int64 xstaticl = 0xA57E5A825A825A82; // -fO fO fO fO 
static _int64 rounding = 0x0000400000004000; 

16 
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asm { 

mov eax, in 
mov ecx, out 

movd mmO, [eax] / / mmO = xx, xx il, iO, 

5 pshufw mml, mmO, 01000100b 7/1111X11 = 11,10,11,10, 

pmaddwd mml, xstaticl // mml = i0*f0 -il*f0, i0*f0 + 

il*f0 

paddd mml, rounding / / do proper rounding 

psrad mml, 15 

10 packssdw mml, mm7 / / mml - x, x, ol, oO, 

movd [ecx], mml 



The computational complexity of the 3-point DCT can be simplified as 
follows: 
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Table 6 
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In one embodiment of the invention the following is the MMX™ code 

implementation of the 3 point DCT algorithm: 

void fsadct3_mmx (short in [3], short out [3]) 
' { 

5 static _int64 xconstl = 0x0000000049E749E7; // fO fO 0 1 
static _int64 xconst2 = 0x5A82A57E977D3441; // fl -fl -f3 f2 
static int64 rounding = 0x0000400000004000; 

asm { 

10 mov eax, in 

movd mmO, [eax] / / 0 0 il iO 

movd mm5, [eax+2] / / 0 0 i2 il 

movq mm7, rounding 

15 pshufw mm4, mmO, 00111100b // iO 0 0 iO 

pshufw mm3, mm5, 11010001b / / 0 i2 il i2 
paddsw mm4, mm3 // iO i2 il i0+i2 

movq mml, mm4 

pmaddwd mm4, xconstl // 0 (i0+il+i2 * fO) « 15 

20 //o0«15 

pmaddwd mml, xconst2 // (i0-i2)*fl « 15 i0+i2*f2 - 

il*f3 

//ol«15 o2«15 

25 paddd mml, mm7 //do proper rounding 

paddd mm4, mm7 / / do proper rounding 

psrad mml, 15 // oO 

psrad mm4, 15 // ol o2 

packssdw mml, mm7 // x x ol o2 
30 mov eax, out 

pshufw mm2, mml, 11110001b // x x o2 ol 

packssdw mm4, mm7 / / x x x oO 

movd [eax], mm4 / / save oO 

movd [eax+2], mm2 // save ol, o2 

35 } 



The 4-point DCT can be computed as shown in Table 7. Multiplication 

operations can be paired (or grouped) within the matrix. 
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Table 7 

The above 4-point DCT can be further written as: 
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Table 8 

Even if the upper left cosine block in the original matrix is further 
factored, leaving two multiplication operations, two PMADDWD operations 
would still be needed, plus a substantial amount of additional instructions to 
shuffle and add the results. Many existing algorithms do not reduce the clock 
cycle count of the implementation in MMX™ although they minimize the 
number of addition and multiplication operations. To reduce processor time 
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by having operations done with minimal SIMD operations (e.g., 
PMADDWD), the above 4-point DCT can be further written as: 
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Table 9 
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In one embodiment of the invention, the following is the MMX™ 
implementation of the 4 point DCT algorithm: 
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void fsadct4_mmx (short in[4], short out[4]) 
{ 

static _int64 xstaticl = Ox4000C00040004000; // fO -fO fO fO 
static _int64 xstatic2 = 0xDD5D539F22A3539F; // -fl f2 fl f2 
static _int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 
mov ecx, out 

movq mmO, [eax] // i3 i2 il iO 

pshufw mml, mmO, 00011011b // iO il i2 i3 
movq mrr^, mml 
paddsw mm2, mmO 
psubsw mmO, mml 
pmaddwd mm2, xstaticl 
pmaddwd mmO, xstatic2 
paddd mnxZ, rounding 
paddd mmO, roxmding 
psrad mn^, 15 



// bO bl bl bO 
//-b3-b2 b2 b3 
// ol « 15 oO « 15 
// o3 « 15 o2 « 15 
//do proper rounding 
/ / do proper rounding 
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psrad mmO, 15 

packssdw mm2, mmO // o3 ol o2 oO 

pshufw mm3, mm2, 11011000b // o3 o2 ol oO 
movq [ecx], mm3 



The following matrix definitions are presented for illustrative purposes 
to define, or name, specific matrices in tables 7, 8 and 9. The values within 
the matrices defined below represent one embodiment of the invention. 



1 

V4 

, — cos(— ) 
V4 8 

(2 .2k. 
I— cos( — ) 
|4 8 

2 3n. 
— cos( — ) 
4 8 



1 

44 

2 3jc 

— cos( — ) 
4 8 

2 ,6k. 

— cos( — ) 
4 8 

f2 ,9x. 
/— cos( — ) 
4 8 



1 

V4 

— cos( — ) 
J 8 

2 ,Wk^ 

— cos( ) 

4 8 

T 157T 

— cos( ) 

4 8 



1 

V4 

2 7*. 

cos( — ) 
4 8 

— cos( ) 

4 8 

"2 ,2\n, 

— cos( ) 

4 8 



= Matrix [A], as shown in Table 7 



As shown in Table 8, Shuffle Matrix [S] - 

" 1 0 0 0 " 

0 0 10 

0 10 0 

0 0 0 1 



15 



V2L1 -1 



0 



0 

3k k 
cos(— ) cos(— ) 

o o 

-cos(— ) cos( — ) 
8 8 . 



= Multiplication Matrix [M], as shown in Table 9 
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As shown in Table 8, Butterfly Matrix [B] = 

"10 0 1 " 
0 110 
0 1-10 
10 0-1 

Group 1 and Group 2 as shown below are presented for illustrative 
purposes to define a part of matrix [M] in Table 8. The values below represent 
one embodiment of the invention. Group 1 and Group 2 are "paired/ 7 or 
"grouped". That is, the multiplication operations within matrix [M] of 
predetermined values are paired, thereby reducing processor instructions. 



Group 1 = 
1 

V2 
1 

V2 



Group 2 = 
1 

V2 

_ _1_ 

The method described above can be provided in applications (e.g., 
video applications) to potentially increase the performance of the applications 
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by decreasing the time to perform n-point DCT, n-point IDCT, SA-DCT, and 
SA-IDCT over known techniques. In one embodiment, the MMX™ versions 
of the n-point DCTs performed from 1.3 to 3.0 times faster than fixed-point 
versions. In one embodiment in which a complete and optimized 
implementation of SA-DCT /SA-IDCT on Intel processors is demonstrated, 
the SA-DCT/SA-IDCT process is increased by 1.1 to 1.5 times. 

Also compared in table 10 is the performance of an MMX™ 8x8 
DCT/IDCT embodiment. 





Time (seconds after 10 
million iterations) 


Increase in speed when 
using MMX™ 




Floating- 
Point 


Integer 


MMX™ 


Speed Increase 

from 
Floating Point 


Speed 
Increase from 
Integer 


DCT 


2 


1260 


830 


600 


2.10 


1.38 


DCT 


3 


1100 


1040 


770 


1.42 


1.35 


DCT 


4 


1430 


1380 


710 


2.01 


1.94 


DCT 


5 


1810 


1700 


1050 


1.72 


1.61 


DCT 


6 


2140 


2030 


1100 


1.94 


1.84 


DCT 


7 


4070 


3020 


1200 


3.39 


2.51 


DCT 


8 


4400 


3460 


1150 


3.82 


3.00 


IDCT 


2 


830 


770 


600 


1.38 


1.28 


IDCT 


3 


1100 


1040 


770 


1.42 


1.35 
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IDCT 


4 


1540 


1430 


710 


2.16 


2.01 


IDCT 


5 


1920 


1870 


880 


2.18 


2.12 


IDCT 


6 


2310 


2140 


1210 


1.90 


1.76 


IDCT 


7 


3680 


3460 


1150 


3.20 


3.00 


IDCT 


8 


3740 


2960 


1260 


2.96 


2.34 



Table 10 

Having disclosed exemplary embodiments, modifications and 
variations may be made to the disclosed embodiments while remaining 
within the spirit and scope of the invention as defined by the appended 
claims. 

The following code, shown in the Appendix, represents one 
embodiment of the invention to implement the 5-point DCT,6-point DCT,7- 
point DCT, and 8-point DCT and the 2-point IDCT,3-point IDCT,4-point 
IDCT,5-point IDCT,6-point IDCT,7-point IDCT, and 8-point IDCT algorithms. 



10 



Appendix 



15 



20 



25 



void fsadct5_mmx (short in [5], short out [5]) 
{ 

static __int64 xconstl = 0x2F954CFEE6FC417E; // f 2 f 1 -f4 f3 

static __int64 xconst2 = 0xB3022F95BE821904; // -fl f2 -f3 f4 

static _int64 xconst3 = 0x393E393E000050F4; // f 0 fO 0 f 5 
static __int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 

movqmmO, [eax] // i3 i2 il iO 
pshufw mml, [eax+2], 0101 101 lb // i2 i2 i3 i4 

movqmm2,mm0 // i3 i2 il iO 

rrovqmm6,mml // i2 i2 i3 i4 
paddsw mmO, mml / / x x b2 bO 

psubsw mm2, mml / / x x b3bl 

punpcklwd mmO, mm2 / / b3 b2 bl bO 
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10 



pshufw mml, mmO, 11011000b 
movq mm2, rnrnl 



// b3 bl b2 bO 



pmaddwd mml,xconstl // (bl*fl+b3*f2) « 15 (b0*f3-b2*f4) « 15 
//ol o2 + b4 

// SAVE 

movq mm5, rounding 

pmaddwd mm2, xconst2 // (bl*f2-b3*fl) « 15 (b0*f4-b2*f3) « 15 
//o3 o4-b4 
// SAVE 



15 



20 



0125 



L!30 



:35 



pshufw mm4, mmO, 00001000b / / 0 0 b2 bO 
psllq mm4, 32 / / b2 bO 0 0 

psrlq mm6, 48 / / 0 0 0 i2 

pshufw mm3 ; mm6, 11001100b / / 0 i2 0 i2 
paddsw mm4, rnm3 / / b2 b0+i2 0 i2 

pmaddwd mm4, xconst3 // (b2+b0+i2)*f0 i2*f5 
// (o0)«15 (b4)«15 

movq mm7, mm4 
paddd mm7, mm5 
psrad mm7, 15 
packssdw mm7, mmO 
psrlq mm?, 16 
mov eax, out 

movd [eax], mm7 / / store oO 



/ / do proper rounding 
// oO x 

// x x oO x 



psllq mm4, 32 
psrlq mm4, 32 
psubd mml, mm4 
paddd mm2, mm4 
paddd mml, mm5 
paddd mm2, mm5 
psrad mml, 15 
psrad mm2, 15 
packssdw mml, mm2 
pshufw mmO, mml, 177 
movq [eax+2], mmO 



// 



// 
// 



0 (b4«15) 
//(ol«15) (o2«15) 
// (o3«15) (o4« 15) 
/ / do proper rounding 
/ / do proper rounding 
x ol x o2 
x o3 x o4 
// o3 o4 ol o2 
// o4 o3 o2 ol 



40 



45 



50 



55 



60 



void fsadct6_mmx (short in[6], short out[6]) 
{ 

static __int64 xconstl = 0xCBBF344134413441; // -fO fO fO fO 

static __int64 xconst2 = OxB61924F34000COOO; // -f3 f2 fl -fl 

static _int64 xconst3 = 0x0000132034414762; // 0 f 4 f 0 f5 

static __int64 xconst4 = Ox00004762CBBF1320; // 0 f5 -fO f4 
static _int64 rounding = 0x0000400000004000; 

asm ( 

mov eax, in 

movq mmO, [eax] // i3 i2 il iO 

movq mml, [eax+4] / / i5 i4 i3 i2 

xor eax, eax 



movq mm7, rounding 
pshufw mm2, mml, 01011011b 
movq mml, mmO 
paddsw mmO, mm2 



// i3 i3 i4 i5 



pinsrw mmO, eax, 3 
p>subsw mml, mm2 



// mm0 = 0b2blb0 
// mml =0b5b4b3 



pshufw mm6, mmO, 11111110b // 
paddsw mm6, mmO / / 



0 0 0b2 

0 x bl b0+b2 
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pshufw mm3, mml, 11111011b // 0 0 b5 0 
paddsw mm3, mml //Ox b4+b5 b3 

psllq mm3 ; 32 

pshufw mn^, mm6, 11110100b 
5 paddsw miT^, mm3 / / b4+b5 b3 bl b0+b2 

pmaddwd mm2, xconstl / / mm2 = o3 « 15 oO « 15 

pshufw mm3, mmO, 01100010b / / bl b2 bO b2 
pshufw mm^ mmO, 11001111b // ObO 0 0 
1 0 paddsw mm3, mm4 / / bl b0+b2 bO b2 

pmaddwd mm3 ; xconst2 / / mm3 = o4 « 15 o2 « 15 



15 



20 



25 



30 



35 



40 



movq mm4, mml 

pmaddwd mml, xconst3 / / b5 * f4 « 15 (b4 * fO + b5 * f4) « 15 
pmaddwd mm4, xconst4 / / b5 * f5 « 15 (-b4 * fO + b3 * f4) « 15 
pshufw mm5, mml, 00001110b //xxb5*f4«15 
pshufw mm6, mm4, 00001110b / / x x b5 * f5 « 15 



paddd mm5, mml 
paddd mm6, mm4 

paddd mm2, mm7 
paddd mm3, mm7 
paddd mm5, mm7 
paddd mm6, mm7 
psrad mm2, 15 
psrad mm3, 15 
psrad rrim5, 15 
psrad mm6, 15 

mov eax, out 
packssdw mm3, mm2 



// mm5 = x ol « 15 
/ / mm6 = x o5 « 15 

/ / do proper rounding 



/ / x o3 x oO 
/ / x o4 x o2 



// o3 oO o4 o2 



pshufw mm2, mm3 ; 01 1 10010b / / o4 o3 o2 oO 
movq mml, mm2 



punpcklwd mml, mm5 
movd [eax], mml 
psrlqmm2, 16 
psllq mm6, 48 
paddsw mm2, mm6 
movq [eax +4], mm2 



/ / x x ol oO 

/ / store oO, ol 
// 0 o4 o3 o2 
// o5 00 0 

// o5 o4 o3 o2 

/ / store o2, o3, o4, o5 



void fsadct7_mrnx (short in [7], short out [7]) 
f 

45 static int f0_7 = 0x3061; 

static int fl_7 = 0x42B4; 

static int f2_7 = 0x3DA5; 

static int f3_7 = 0x357E; 

static int f4_7 = 0x2AA9; 
50 static int f5_7 = OxlDBO; 

static int f6_7 = 0x0F39; 

static int f7_7 = 0x446B; 

static int b[7]; 

55 static _int64 xconstl = 0x357E42B40F393DA5; // f3 fl f6 f2 

static _int64 xconst2 = 0xE250357EC25B2AA9; // -f5 f3 -f2 f4 
static __int64 xconst3 = 0xBD4ClDB0D5570F39; // -fl f5 -f4 f6 
static _int64 xconst4 = 0x0000446B30613061; // 0 f7 fO fO 
static _int64 xconstS = 0x00001 DB00000D557; // 0f5 0 -f4 

60 staric _int64 xconst6 = OxO000BD4CO000F0C7; // 0 -fl 0 -f6 

static _int64 xconst7 = Ox0000357E00003DA5; // 0 f3 0f2 
static __int64 rounding = 0x0000400000004000; 
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15 



mov eax, in 

movq mmO, [eax] // i3 i2 il iO 

movq mm2, [eax-t-6] // i6 i5 i4 i3 

pextrw ecx, mmO, 3 / / i3 

pshufw mml, mm2, 00011011b // i3 i4 i5 i6 
movq mm2, mmO 

paddsw rnmO, mml / / x b4 b2 bO 

psubsw mm2, mml / / 0 b5 b3 bl 

movq mml, mmO 

punpcklwd mmO, mm2 / / b3 b2 bl bO 

punpckhwd mml, mm2 / / 0 x b5 b4 

pshufw mm5, mml, 11011100b // 0 b5 0 b4 
pshufw mm4, mmO, 11011000b // b3 bl b2 bO 



20 



0125 



30 



movq mml, mm4 
movqmm2,mrn4 
movq mrrG, mm4 

pmaddwd mml, xconstl 
pmaddwd mm2, xconst2 
pmaddwd mm3, xconst3 

movq mm^ mm5 
pinsrw mm6, ecx, 2 
paddsw mmO, mm6 



// b3 bl b2 bO 
// b3 bl b2 bO 
// b3 bl b2 bO 

// (bl*fl+b3*f3) « 15 (b0*f2+b2*f6) « 15 
// (M*f3-b3*f5) « 15 (b0*f4-b2*f2) « 15 
// (bl*f5-b3*fl) « 15 (b0*f6-b2*f4) « 15 



// 0i3 0b4 

/ / x b2+i3 x b4+b0 



pshufw mm6, mmO, 00001000b / / x x b2+i3 b4+b0 
movq mmO, rounding 

pinsrw mm6, ecx, 2 / / x i3 b0+b2 b4+i3 

pmaddwd mm6, xconst4 // b6 « 15 oO « 15 
movq mm4, mm6 



!«I35 



40 



45 



50 



55 



60 



mov ecx, out 
paddd mm6, mmO 
psrad mm6, 15 
packssdw mm6, mml 
movd dword ptr [ecx], mm6 

movq mm6, mm5 
movq mni7, mm5 
pmaddwd mmS, xconst5 
pmaddwd mm6, xconst6 
pmaddwd mm7, xconst7 



/ / do proper rounding 

/ / x x x oO 
// save oO 

// xb5xb4 
// xb5xb4 

// (+b5*f5) « 15 (-b4*f4) « 15 
// (-b5*fl) « 15 (-b4*f6) « 15 
// (+b5*f3) « 15 (+b4*f2) « 15 



paddd mml, mm5 
paddd mm2, mm6 
paddd mm3, mm7 
psrlq mm4, 32 

psubd mml, mm4 
paddd mm2, mm4 
psubd mm3, mm4 
paddd mml, mmO 
paddd mm2, mmO 
paddd mm3, mmO 

psrad mml, 15 
psrad mrr^, 15 
psrad mm3, 15 



//ol «15 o2«15+b6 
//o3«15 o4«15-b6 
// o5 « 15 06 « 15 + b6 



//ol «15 o2«15 
//o3«15 o4«15 
// o5 « 15 06 « 15 
/ / do proper rounding 
/ / do proper rounding 
/ / do proper rounding 

// ol o2 
// o3 o4 
// o5 06 



packssdw mm2, mml 



// ol o2 o3 o4 
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pshufw mml, mm2, 27 
movq [ecx+2], mml 



// o4 o3 o2 ol 

/ / save ol, o2, o3, o4 



packssdw mm3, mm7 / / x x o5 06 

pshufw mml, mm3, 00000001b 

movd dword ptr [ecx+10], mml / / save o5, 06 



10 



15 



20 



m25 



UI30 



~35 



40 



45 



50 



55 



60 



void fsadct8_mmx (short in[8], short out[8]) 
{ 

static _int64 xconstl = 0xA57E5A825A825A82;// -fO fO fO fO 
static _int64 xconst2 = 0xD2BF2D412D412D41;// -f4 f4 f4 f4 
static __int64 xconst3 = 0xC4E0187D187D3B20; // -f2 f6 f6 f2 
static _int64 xconst4 = 0x3536238E0C7C3EC5; // f3 f5 f7fl 
static _int64 xconstS = 0xDC723536C13B0C7C ;//-f5 f3 -fl f7 
static int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 

movqmmO, [eax] // i3 i2 il iO 

movq mml, [eax+8] / / i7 i6 i5 i4 

pshufw mm2, mml, 00011011b // i4 i5 i6 i7 
movq mm7, rounding 
movq mml, mmO 

paddswmm0,mm2 // mmO = b[3] b[2] b[l] b[0] (A*) 

psubswmml,mm2 // mml = b[4] b[5] b[6] b[7] (A*) 



pshufw mm2, mmO, 0000101 lb 
rrovqrnm3,rnrn0 
psubsw mm3, mm2 
paddsw mm2, mmO 
pshufw mm4, mml, 00001100b 
pshufw mmO, mml, 10011001b 
pmaddwd mmO, xconstl 
paddd mmO, mm7 



//xxb[2]b[3] 



psrad mmO, 15 
packssdw mmO, mml 



// mm3 = x x b2[2] b2[3] (B*) 
// mm2 = x x b2[l] b2[0] (B*) 
// mm4 = x x b2[4] b2[7] (B*) 
//b[5] b[6] b[5] b[6] 
//b2[5] «15 b2[6] « 15 
/ / do proper rounding 



// b2[5] b2[6] 

// mmO = x x b2[5] b2[6] (B*) 



pshufw mm5, mm2, 01000100b 
pmaddwd mm5, xconst2 / / 
pshufw mm2, mm3, 01000100b 
pmaddwd mm2, xconst3 / / 
paddd mm5, mm7 / / 

paddd mm2,mm7 // 
psrad mm5, 15 
psrad mm2, 15 
packssdw mm5, mm2 
movq rnml, mm4 
paddsw mm4, mmO 
psubsw mml, mmO 
punpcklwd rnm4, mml / / 

pshufw mm3, mm4, 11011000b 



//b2[l]b2[0] b2[l]b2[0] 

o4 « 15 oO « 15 

// b2[2] b2[3] b2[2] b2[3] 

06 « 15 o2 « 15 

do proper rounding 

do proper rounding 



// mm5 = 06 o2 o4 oO (Y*) 



//xxb3[4] b3[7] 

// x xb3[5] b3[6] 

b3[5] b3[4] b3[6] b3[7] 

// mm4 = b3[5] b3[6] b3[4] b3[7] (C*) 



movqmm^rnrn3 
pmaddwd mm3, xconst4 
pmaddwd mm4, xconstS 
paddd mm3, mm7 
paddd mm4, mm7 
psrad mm3 ; 15 
psrad mm4, 15 
packssdw mm3, mm4 



//o5«15 ol«15 
//o3«15 o7«15 
/ / do proper rounding 
/ / do proper rounding 



// mm3 = o3 o7o5 ol (Y*) 
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pshufw mmO, mm5, 11011000b // mmO = 06 o4 o2 oO 
pshufw mml, mm3, 10011100b // mm3 = o7 o5 o3 ol 
movq mm2, mmO 

punpcklwd mm2, mml / / o3 o2 ol oO 

punpckhwd mmO, mml / / o7 06 o5 o4 

mov eax, out 
movq [eax], mm2 
movq [eax+8], mmO 



//////////////////////////////////////////// 

void fsaidct2_mmx (short in[2], short out[2]) 
{ 

static _int64 xconstl = 0xA57E5A825A825A82; // -fO fO fO fO 
static __int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 
mov ecx, out 
movd mmO, [eax] 

pshufw mml, mmO, 01000100b / / il iO il iO 
pmaddwd mml, xconstl / / ol « 15 oO « 15 

paddd mml, rounding / / do proper rounding 

psrad mml, 15 

packssdw mml, mm7 // x x ol oO 

movd [ecx], mml 



void fsaidct3_mmx (short in[3], short out[3]) 
{ 

static __int64 xconstl = 0x49E7977D49E73441; // fO -f3 fO f2 
static __int64 xconst2 = OxOOOOA57E00005A82; / / 0 -fl 0 fl 
static _int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 

movd mmO, [eax] / / 0 0 il iO 

movd mml, [eax+2] / / 0 0 i2 il 

mov eax, out 

movq mm7, rounding 

psllq mmO, 32 

paddd mrnO, mml // il iO i2 il 

pshufw mml, mmO, 1001 1001b // iO. i2 iO i2 
pmaddwd mml, xconstl / / ol « 15 b2 « 15 
pshufw mm2, mmO, 11111111b // il il il il 
pmaddwd mrn2, xconst2 // -bl « 15 bl « 15 
pshufw mm3, mml, 01 000100b // b2 « 15 b2 « 15 
paddd nnrn2,rnm3 // o2 « 15 oO « 15 

paddd mml, mm7 //do proper rounding 

paddd mm2, mm7 

psrad mml, 15 / / x ol x x 

psrad mm2, 15 / / x o2 x oO 

movd [eax], mm2 / / store oO 

packssdw mml, mm2 / / o2 x ol x 

pshufw mm2, mml, 11111101b // x x o2 ol 
movd [eax+2], mm2 // store ol, o2 
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void fsaidct4_mmx (short in[4], short out[4]) 
{ 

static _int64 xconstl = 0xC000400040004000; // -fO fO fO fO 
5 static _int64 xconst2 = 0xAC6122A322A3539F; // -f2 fl fl f2 
static int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 

10 movq mm7, rounding 

movq mmO, [eax] // i3 i2 il iO 

mov eax, out 

pshufw mml ; mmO, 10001 000b // i2 iO i2 iO 
pmaddwd mml, xconstl // mml = b[l] « 15 b[0] « 15 
15 pshufw mm2, mmO, 11011101b // i3 il i3 il 

pmaddwd mm2, xconst2 / / mm2 = b[2] « 15 b[3] « 15 
movq mm3, mml 

padddrrtml,mm2 // ol « 15 oO « 15 

psubd rnm3, mm2 // o2 « 15 o3 « 15 

20 paddd mml, mm7 / / do proper rounding 

paddd mm3, mm7 
_ psrad mml, 15 

i=P psrad mm3, 15 

packssdw mml, mm3 // o2 o3 ol oO 

Qf25 pshufw mm2, mml, 10110100b / / o3 o2 ol oO 

SA movq [eax], mm2 



;fj30 void fsaidct5_mmx (short in [5], short out [5]) 

= static __int64 xconstl = 0xB3022F952F954CFE; // -fl f2 f2 fl 

Q static _int64 xconst2 = 0xBE82E6FC1904417E; // -f3 -f4 f4 f3 

y2 static _int64 xconst3 = Ox0000393E0000393E; / / 0 fO 0 fO 

Q35 static _int64 xconst4 = Ox0000000050F4AFOC; // 0 0 f5 -f5 

'~Z static _Jnt64 rounding = 0x0000400000004000; 



i mov eax, m 

40 movq mmO, [eax+2] / / mmO = i4 i3 i2 il 

movd ^^16, [eax] // x x il iO 

movq mm7, rounding 

pshufw mml, mmO, 10001000b / / i3 il i3 il 
pmaddwd mml, xconstl / / mml = b2 « 15 bl « 15 
45 pshufw mm2, mmO, 11011101b / / i4 i2 i4 i2 

movq mmS, mm2 

pmaddwd mm2, xconst2 // b4 « 15 b3 « 15 
pshufw mm3, mm6, 00000000b / / iO iO iO iO 
pmaddwd mm3, xconst3 / / mm3 = bO « 15 bO « 15 
50 padddmrn2,mm3 // mm2 = b4« 15 b3 « 15 

movq mm4> mm2 

paddd mm4,mml // ol « 15 oO « 15 

psubd mm^ mml // o3 « 15 o4 « 15 

pmaddwd mm5, xconst4 / / x (i4 - i2) * f5 
55 paddd mm5, mm3 // x o2 « 15 

mov eax, out 

paddd mm4, mm7 / / do proper rounding 

paddd mm2, mm7 
60 paddd mm5, mm7 

psrad mm^ 15 
psrad mm2, 15 



30 



042390.P8657 Expre^^^il No. EM522828778US 



10 



20 



psrad mm5 ; 15 

packssdw mm4, mm2 / / o3 o4 ol oO 

packssdw mm5, mm7 / / x x x o2 

movd [eax], mm4 // store oO, ol 
pshufw mm3, mm4, 00001011b / / x x o4 o3 

movd [eax+4], mm5 / / store o2 

movd [eax +6], mm3 / / store o3, o4 



void fsaidct6_mmx (short in[6] / short out[6]) 
I 

static __int64 xconstl = OxC000344140003441; // -fl fO fl fO 
static __int64 xconst2 = 0x000024F33441B619; // 0 f2 fO -f3 
15 static _int64 xconst3 = 0x4762132013204762; // f5 f4 f4 f5 
static _int64 xconst4 = 0x344100003441CBBF; // fO 0 fO -fO 
static int64 rounding = 0x0000400000004000; 



asm { 

mov eax, in 

movq mmO, [eax] // i3 i2 il iO 

0 movd mml, [eax+8] / / 0 0 i5 i4 
iO punpcklwd mml, mmO // il i5 iO i4 

01 pshufw mm2, mmO, 10001000b t // i2 iO i2 iO 

S.l25 pmaddwd mm2, xconstl // bO-bl « 15 bO+bl « 15 

^ pshufw mm3, mml, 00000100b / / i4 i4 iO i4 

jtJ pmaddwd mm3, xconst2 // mm3 = (b2) « 15 b4 « 15 

0130 pshufw mm4, mm3, 11101110b // b2 « 15 b2 « 15 

= paddd mm2, mm4 // mm2 = b5 « 15 b3 « 15 

J| pshufw mm4, mml, 11101110b // il i5 il i5 

sTs mov eax, out 

35 pmaddwd mm4, xconst3 // il*f5+i5*f4 « 15 il*f4+i5*f5 « 15 

^-ff pextrw ecx, mml, 2 

movq mm7, rounding 
O pinsrw mmO, ecx, 0 // i3 i2 il i5 

pmaddwd mmO, xconst4 / / b2 « 15 (il-i5)*f0 « 15 
40 pshufw mml, mmO, 11101110b // b2 « 15 b2 « 15 

movq mm6, mml 

padddmml,rnm4 // bO « 15 x 

psubd mm4, mm6 //xb2«15 
psubd mmO, mm6 / / mmO = x bl « 15 

45 psrlq mml, 32 

psllq mm4, 32 

paddd mml, mm4 / / mml = b2 « 15 bO « 15 

movq mm5, mml 

mcvqmm6,rnm0 

50 paddd mml, mm2 // o2 « 15 oO « 15 

psuMmm2,mm5 // o3 « 15 o5 « 15 

paddd mmO,mm3 / / x ol « 15 

psubd mm3, mm6 // x o4 « 15 

paddd mmO, mm7 / / do proper rounding 

5 5 paddd mml, mm7 

paddd mm2, mm7 
paddd mm3, mm7 
psrad mmO, 15 
psrad mml, 15 

60 psrad mm2, 15 

psrad mm3, 15 

packssdw mml, mmO // x ol o2 oO 
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packssdw mm2, mm3 / / x o4 o3 o5 

psllq rnml, 16 / / ol o2 oO 0 

pshufw mm4, mm2, 01010010b // o3 o3 o5 o4 
pextrw ecx, 1x11114, 3 

pirtsrw mml, ecx, 0 // ol o2 oO o3 

pshufw 1111113, mml, 00101101b // o3 o2 ol oO 
movq [eax], mm3 // save ol, o2, o3, o4 

movd [eax+8], mm4 / / save o5, 06 



void fsaidct7_mmx (short in[7], short out[7]) 
{ 

static _int64 xconstl = 0x357EE25042B4357E; // £3 -f5 fl f3 
15 static __int64 xconst2 = 0x0F39C25B3DA52AA9; // f6 -f2 f2 f4 

static _int64 xconst3 = 0xlDB0BD4CD557F0C7; // f5 -fl -f4 -f6 
static _int64 xconst4 = 0x0000BD4C00001DB0; / / 0 -fl 0 f5 
static _int64 xconstS = 0x3061 D55730610F39; // fO -f4 fO f6 
static _int64 xconst6 = 0x0000357E30613DA5; // 0 f3 fO f2 
20 static _int64 xconst7 = Ox446BBB953061BB95; // f7 -f7 fO -f7 

static _int64 rounding = 0x0000400000004000; 

^ asm { 

yl mov eax, in 

rp25 movqmmO, [eax+2] // i4 i3 i2 il 

%J pshufw mm6, mmO, 00100010b // il i3 il i3 

m pmaddwd mm6, xconstl // il*f3-i3*f5 il*fl+i3*f3 

~i // (b2) (bl) 

■ pshufw mm5, mmO, 01110111b // i2 i4 i2 i4 

yj30 pmaddwd mm5, xconst2 // i2*f6-i4*f2 i2*f2+i4*f4 

^ s //(b5) (b4) 

a pshufw mm4, mmO, 00100111b // il i3 i2 i4 

Q pmaddwd mm4, xconst3 // il*f5-i3*fl -i2*f4-i4*f6 

•in // 0>3) (b6) 

ni 35 

S mov ecx, [eax] // il iO 

Jff movd mm7, [eax+10] / / 0 0 i6 i5 

y mov eax, out 

O pinsrw mm 7, ecx, 3 / / iO 0 i6 i5 
40 pshufw mm3, mm7, 00000000b // i5 i5 i5 i5 
pmaddwd mm3, xconst4 

padddmm3,mm6 //mm3 = b2«15 bl « 15 

pshufw mm2, mm7, 11011101b // iO i6 iO i6 
45 pmaddwd mm2, xconstS 

padddmm2,mm5 // mm2 = b5 « 15 b4 « 15 

pshufw mml, mm7, 00001101b // i5 i5 iO i6 
pmaddwd mml, xconst6 
50 paddd mml, mm4 / / mml = b3 « 15 b6 « 15 

pshufw mm4, mm7, 10101101b / / 0 0 iO i6 
movq mm7, rounding 

pshufw mm5, mmO, 11111101b / / x x i4 i2 
55 psllq mm5, 32 

r>adddmm4,mm5 // i4 i2 iO i6 

pmaddwd mm4, xconst7 
pshufw mm5, mm4, 00001110b // x x i4 i2 
paddd mm4, mm5 // x o3 « 15 



movq mm5, rnm2 

paddd mmS, mm3 // mm5 = ol « 15 oO « 15 
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psubd mm2, mm3 / / mm2 = o5 « 15 06 « 15 

pshufw mm3, mml, 00001110b // x b3 « 15 
movq rnm6, mml 

paddd rnml, mm3 //xo2«15 

psubd mm6, mm3 // mm6 = x o4 « 15 

psllq mm4, 32 
psllq mml, 32 
psrlq mml, 32 

paddd mml, mm4 / / mml = o3 « 15 o2 « 15 



paddd mml, mm7 / / do proper rounding 

paddd rnrn2, mm7 

paddd mm5, mm7 

paddd mm6, mm7 

psrad mml, 15 

psrad mm2, 15 

psrad mm5, 15 

psrad mm6, 15 

packssdw mm5, mml // o3 o2 ol oO 

packssdw mm2, mm6 / / x o4 o5 06 

pshufw mm3, mm2, 00010010b / / 06 o5 x o4 



movq [eax], mm5 
movd [eax +8], mm3 
psrlq mm3, 32 
movd [eax+10], mm3 



// store oO, ol, o2, o3 
// store o4 

/ / store o5, 06 



void fsaidct8_mmx (short in[8], short out [8]) 
{ 

static _int64 xconstl - 0x0C7C3EC5C13B0C7C; // f7 fl -fl f7 
static _int64 xconst2 = 0x238E35363536DC72; // f5 f3 f3 -f5 
static _int64 xconst3 = 0x2D41D2BF2D412D41;// f4 -f4 f4 f4 
static „int64 xconst4 = 0x3B20187D187DC4E0; // f2 f6 f6 -f2 

static int64 xconstS = 0x00005 A8200005A82; // 0 fO 0 fO 

static _int64 rounding = 0x0000400000004000; 



asm { 

mov eax, in 
movq mmO, [eax] 
movq mml, [eax +8] 
mov eax, out 

pshufw mm2, mmO, 10001101b 
pshufw mm3, mml, 00100111b 
movq mml, mm2 
punpcklwd mm2, mm3 
punpckhwd mml, mm3 

pshufw mm3, mm2, 01000100b 
pmaddwd mm3, xconstl 
pshufw mm4, mm2, 11101110b 
pmaddwd mm4, xconst2 

pshufw mmO, mml, 00110011b 
pmaddwd mmO, xconst3 
pshufw mm2, mml, 10011001b 
pmaddwd mm2, xconst4 
pshufw mm5, mm2, 01001110b 



// i3 i2 il iO 
// i7 i6 i5 i4 

// i2 iO i3 il 
// i4 i6 i5 i7 

// i5 i3 i7 il 
// i4 i2 i6 iO 

// \7 il i7 il 
// b3[l] b3[0] 

// i5 i3 i5 i3 
// b3[3] b3[2] 

// iO i4 iO i4 
//mm0 = bl bO (*A) 
// il i6 \2 i6 

//mm5 = b2 b3 (*A) 
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15 



20 



25 



30 



35 



40 



45 



rrovqrrurd,mm3 
psubd mm3, mm4 
iTK3vq mrn7, mm3 
punpckhdq mm7, mm6 
movqmm^mmZ 
paddd mm7, mm3 
psubd mm2,mm3 
punpckldq mm2, mm7 
movq mm7, rounding 
paddd mm2, mm7 
psrad mm2, 15 

paddd mml, mm4 

movq mm4, mmO 

paddd mmO, mm5 

psubd mm^mrnS 

pshufw mm6, mm4, 01001110b 

pmaddwd mm2, xconstS 

movq mm5, mm2 
punpckldq mm2, mml 
punpckhdq mml, mm5 

mc^qmm3 / mm0 
paddd mmO, mml 
psubd rTim3,mml 
movq mm5, mm6 
paddd mm6, mm2 
psubd mm5,mm2 

paddd mmO, mm7 
paddd mm3, mm7 
paddd mm6 ; mm7 
paddd mm5, mm7 
psrad mmO, 15 
psrad mm3, 15 
psrad mm6, 15 
psrad mm5, 15 

packssdw mmO, mm6 
packssdw mm3, mm5 
pshufw mml, mm3, 00011011b 
movq [eax], mmO 
movq [eax+8], mml 



//b6«15 b5«15 

/ / x b6 « 15 

// x b[6]+b[5] 
/ / x b[6] - b[5] 
/ / b6+b5 b6-b5 



// b2[7] b2[4] 

//mm0 = b2[l] b2[0] (*C) 

//mm6 = b2[3] b2[2] (*C) 
//b2[6] b2[5] 



//mm2 = b2[4] 
//mml=b2[6] 



// ol « 15 
// 06 « 15 

// o3 « 15 
// o4 « 15 



b2[5] (*C) 
b2[7] (*C) 



oO « 15 
o7« 15 

o2 « 15 
o5 « 15 



/ / do proper rounding 



// o3 o2 ol oO 
// o4 o5 06 o7 
//o7o6o5o4 
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